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ABSTRACT 

This article studied the application of ridge regression on multicollinear data whose ridge parameter was 
determined using bootstrap samples. Mean squared error of the samples and some arbitrary values were used to determine 
the ridge parameter that will give the minimum residual. The result of the study revealed that both the mean squared error 
and the smallest eigenvalue of the predictor variables of the original data play vital role in determining the ridge parameter 
of ridge regression. 
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1.0 INTRODUCTION 

The observations in many experiments of physical and medical sciences are often ill-conditioned. Their 
distribution can be highly skewed, they can have tail thicker than that of normal distribution, and random samples often 
have outliers and correlated variables. Outliers, correlated variables and heavy-tailed distribution are serious problems 
because they inflate the standard error of the estimates, causing them to have relatively low power. In regression analysis, 
when the predictor variables X=(X 1 , ...,X p ) have X'X that is of full rank least squares estimators are usually unbiased. The 
Gauss-Markov property assures us that the LS estimator has minimum variance in the class of unbiased linear estimators. 
Ordinary least squares regression is usually affected by outliers, multicollinearity, as well as skewed or heavy tailed 
distributions. When the method of least squares is applied to nonothogonal variables, very poor estimates of the regression 
coefficients are usually obtained. The variance of the estimates of the regression coefficients may be considerably inflated, 
and the length of the vector of coefficients too long on the average and very unstable. One of the common techniques to 
overcome the difficulty of least squares when the data is ill-conditioned is to drop the basic assumption of regression 
analysis that the estimators of regression coefficients be unbiased. Ridge regression is one of the biased estimators of 
regression coefficients that is applied to data whose predictor variables have X'X matrix that is ill-conditioned (near 
singular) or even singular (has zero eigenvalues). In this article ridge regression will be applied to multicollinear real data 
whose ridge parameter will be determined using bootstrap samples, j 

2.0 RIDGE REGRESSION 

A method of regression analysis that is effective in the presence of multicollinearity was proposed by Hoerl and 
Kennard (1970) and is called ridged regression. Assuming that the data (X and Y) have been standardized, they suggested 
that some constant values will be added to diagonal elements of the X'X matrix of the predictor variables to have regression 
coefficient that can be estimated from the modified normal equations, hereafter called ridged equation 

(X'X + kl)b(k) = X'y 1 
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from which the regression coefficients can be estimated. 

Making b(k) the subject of the formula in (1) gives 

b(k) = (X'X + kI) _1 X'y 2 



with k > 0, the nonstochastic quantity, being the ridge (or control) parameter. Of course b(0) = b is the OLS 
estimate. b(k) = [b 1 (k), ... , b p (k)]' contain estimates of the parameters in the non-intercept part of the model. 

Multiplying both sides of (2) by (X'X) -1 we have that 



b(k) = (I + k(X'X) -1 )j3 3 

where p denotes the LS estimate in standard form. The equation (3) shows that the ridge estimator is biased and 
the amount of bias depends on the ridge (or the control) parameter k. 

Using the abbreviation 

G k = (X'X + kl) -1 4 

E(b(k)), Bias(b(k), (3) and V(b(k)) can be expressed as follows 

E(b(k)) 

E(b(k)) = E((X'X + kl) -1 X'y)\ 

= (X'X + kl) -1 X'E(y) 

= (X'X + kl) -1 X'Xp 

= G k X'X(3 5 

Bias(b(k), P) 

Bias(b(k), p) = E(b(k) - p) 

= G k X'Xp - p 
= (X'X + kl) -1 X'Xp - p 



x'xp 

(X'X+kl) 



x'xp-p(x'x+kl) 

(X'X+kl) 



= — kpG k 6 

V(b(k)) 

V(b(k)) = var((X'X + kI) _1 X'y) 

= var(G k X'y) 

=G k X'var(y)XG k 
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= cr 2 G k X'XG k 7 

Hence the Mean Dispersion Error (MDE) matrix is 
M(b(k), P) = E(b(k) — P)(b(k) — P)' 

= E [(b(k) - E(b(k))) + (E(b(k)) - p)] [(b(k) - E(b(k))) + (E(b(k)) - p)]' 

= E[{b(k) - E(b(k))}{b(k) - E(b(k))}' + (b(k) - E(b(k))}{E(b(k)) - p}' + {E(b(k}) - P}{b(k) - 

E(b(k))}' + {E(b(k}) - P}{E(b(k)) - p}'] 

= V(b(k)) + Bias(b(k),p)Bias(b(k),p)' 

= ct 2 G k X'XG k + (— kpG k )(— kpG k )' 

= G k (a 2 X'X + k 2 pp')G k 

_ tT 2 X'x+k 2 p(?/ 

~~ (X'X+kl) 2 

From the spectral decomposition of the symmetric matrix X'X we have that 
X'X = PAP' = A 

G k _1 = X'X + kl = PAP' + kPP' = (A + k) 

Note that PP' = 1 



G k = (A + k) -1 
Therefore 



M(b(k), P) = 



tr 2 A+k 2 pp/ 

(A+kl ) 2 



trM(b(k),p)=i;! { =i^p ! 9 

The scalar MDE of b(k) for fixed o 2 and a fixed vector p is a function of ridge parameter k, which starts at 



£i =1 — = tr(V(b)) for k = 0, takes its minimum for k = k opt and then increases mo no tonic ally, provided that k opt < oo 
(Rao and Toutenburg 1995) 

We now transform M(b, P) = M(b) = ct 2 (X'X) _1 as follows 

M(b) = a 2 G k (G k - 1 (X'X)- 1 G k - 1 )G k 



= o 2 G k [(X'X + kl) (X'X)” 1 (X'X + kI)]G k 

= a 2 G k [(X'X) (X'X) - 1 (X'X) + (X'X) (X'X) -1 kl + kl (X'X)" 1 (X'X) + kl(X'X) _1 kl]G k 

= tr 2 G k ((X'X) + k 2 (X'X) _1 + 2kl)G k 10 
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Definition 1: Let (3j and j3 2 be two estimators of (3. Then (3 2 is called MDE-superior to j3 x (or (3 2 is called MDE- 
improvement to (3j ) if the difference of their MDE matrices is nonnegative definite, that is, if 

A(p i J 2 ) = M(p i ,(3)-M(p 2 ,(3)>0 

From definition 1 we obtain the interval 0 < k < k* in which the ridge estimator is MDE-superior to the OLS b = 
(X'X) -1 X'y. 

A(b, b(k) ) = M(b) - M(b(k), (3) 

= o 2 G k ((X'X) + k 2 (X'X) _1 + 2kl)G k - G k (o 2 X'X + k 2 (3(3')G k 
= o 2 G k (X'X)G k + o 2 G k k 2 (X'X) _1 G k + o 2 G k 2kIG k - G k o 2 X'XG k + G k k 2 (3(3'G k 

= kG k [o 2 (2I + k(X'X)- 1 ) - k(3(3']G k 11 

Since G k > 0, we have that A(b, b(k) ) > 0 if and only if 
o 2 (2I + k(X'X) -1 ) — k(3(3' > 0 

ct 2 (2I + k(X'X) -1 ) > k(3(3' 12 

Dividing through by k(3(3' gives 

CT 2 (21+k(X/X)~ 1 ) > 
kpp/ 

^ <1 

CT 2 (2I+k(X'X) -1 ) 

cj- 2 k(3'(2I + k(X'X) -1 ) -1 (3 <1 13 

As a sufficient condition for (12), independent of the model matrix, we obtain 
2ct 2 I — k(3(3' > 0 
2 cj 2 I > k(3(3' 

, , 2 ct 2 , . 

k < — 14 

PP' 

The range of k, which ensures the MDE-1 superiority of b(k) compared to b is dependent on a -1 (3 and hence 
unknown. If auxiliary information about the length ( norm) of (3 is available in the form 

(3(3' < r 2 

then 




is sufficient for (14) to be valid. Hence possible values for k, in which b(k) is better than b, can be found by estimation of 
ct 2 or by specification of a lower limit or by a combined a priori estimation o 2 (3(3' < r 2 

Swamy et al (1978) and Swamy and Mehta (1977) investigate the following problem 
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m ‘ n {a- 2 (y-Xp)'(y-Xp)p'p<r 2 } 

The solution to this problem 

PCpi) = (X'x + CT 2 piI) _1 X , y 16 

is once again a ridge estimate and j3(|t)'P(|i) = r 2 is fulfilled. Replacing o 2 by the estimate s 2 provided a practical solution 
for the estimator (16) but its properties can only be calculated approximately. 

Hoerl and Kennard (1970) derived the ridge estimator by the following reasoning. Let j3 be any estimator and 
b = (X'X) _1 X'y the OLS estimator. Then the error sum of squares estimated with (3 can be expressed, according to the 
property of optimality of b, as 

S(p) = (y-Xp)'(y-Xp) 17 

= ((7 - Xb) + X(b - p)) ' ((y - Xb) + X(b - p)) 

= (y - Xb)'(y - Xb) + (b - p)X'X(b - p) + 2(y - Xb)'X(b - p) 

= S(b) + O(p) 18 

noting that the term 

2(y - Xb)'X(b - P) = 2(y - X(X'X)- 1 X'y)'X(b - p) 

= 2y(l - XCX'X^XOXO - p) 

= 2MX(b - p) = 0 

since MX = 0 



Let O 0 > 0 be a fixed given value for the error sum of squares. Then a set {p} estimate exists that fulfill the 
condition S(p) = S(b) + O 0 . In this set we look for the estimate with minimal length 



min 

0 



{p'p + i[(b-p)'x'x(b-p)- o 0 ]} 



19 



1 ^1 
where - is a Lagrangian multiplier. Differentiation of this function with respect to p and - leads to the normal 



equations, 

let 



j{p'P + l[(b- p)X'X(b - P) - O 0 ]} = L 
|=2p-2i(b-p)X'X=0 

2p + 2^X'X(p-b) = 0 
Since 2=£ 0 
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P + ix'X(p-b) = 0 

B + ix'XB = ix'Xb 

k k 

P(jx'X+l) = ix'Xb 

p = (X'X + kI) _1 X'Xb 

= G k X'Xb 20 

| = ((b-p)'x'x(b — p)— o 0 ) = o 

(b - p) X'X(b - P) - 0 O = 0 

0 O = (b - p)X'X(b - p) 21 

Hence the solution of the equation (19) is the ridge estimator p = b(k) in (20). 

Minimizing a penalized residual sum of squares the ridge estimator b(k) is 

b(k) = argminp |(y E - ZjLiXjjPj) + k2 P =1 Pf] 22 

The ridge parameter k is to be determined iteratively so that (21) is fulfilled. 

Consider y(k) = Xb(k) to be estimated y. 

y(k) = X(X’X + kl) _1 Xy 

= X(X’X) _1 {I + k(X'X) _1 } _1 Xy 23 

The sum of the squares of the deviation of the y’s from their fitted values is 

(y-y(k))'(y-y(k)) 24 

2.1 BOOTSTRAP 

Bootstrapping describes how sample data can be handled to obtain reliable standard error, confidence interval and 
other measures of uncertainty for a wide range of problems. The key idea is to resample from the original data either 
directly or indirectly or via a fitted model-to create replicate dataset, from which variability of the quantity of interest can 
be assessed without longwinded and error analytical calculations. This process involved repeating the original data analysis 
procedure with many replicate sets of data. The initial reaction was that resampling from the original data is a fraud. But in 
fact it is not (Davision and Hinley 1985). It turns out that a wide range of statistical problems can be tackled this way, 
liberating the investigator from the need to oversimplify complex problems. Bootstrap methods are intended to help avoid 
tedious calculations based on questionable assumptions. But they cannot replace clear critical thought about the problem, 
appropriate design of the investigation, data analysis and incisive presentation of the conclusions. The methods can be 
applied when there is a well-defined probability model for the data and when there is not. There are four optional 
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resampling schemes under bootstrap-classical bootstrap, smooth bootstrap, wild bootstrap, and residual-based (Bayesian) 
bootstrap Hall and Mammen (1994). In this article we wish to make use of classical bootstrap in forming samples from 
which different values of mean square error will be generated. 

The classical bootstrap may be thought of as rather a device for constructing a new data sequence having the same 
size as the original sample. All the member of the new sequence are drawn from the original sample, and are present in 
proportions which are determined by a uniform multinomial distribution on the original values. Of course, the later 
distribution is a consequence of the “random sampling with replacement” concept that underlies the classical bootstrap 
algorithm. Under classical bootstrap {(X,*, yDlisisn is taken at random from the original sample {X i ,y i } 1 < i < n .This 
resampling method goes back to the pioneering work of Efron (1979) 

2.1.1 Statistical Error 

The basic idea of bootstrapping is to approximate the quantity q(f) -such as var((3|f) by the estimateq(f), where f 
is either a parametric or a nonparametric estimate of f based on the data {X i ,y i } 1 < i < n . The statistical error is then the 
difference between q(f), and q(f). Bootstrap methods wish to minimize this error as far as possible or remove it entirely. 

3.0 NUMERICAL APPLICATION 

3.1 Data and its Description 

We applied ridge regression equation to a real data set, and five bootstrap samples generated from the real data 
set. Some arbitrary values of k were also considered to enable us determined how well the ridge parameter of the ridge 
regression can be chosen. The real life data we used were obtained from unpublished B.SC research project presented at 
the Department of Statistics, Nnamdi Azikiwe University, Awka, by Iteire (2004). The data were from Nigeria Stock 
Exchange and is on their transaction for the period of 1991-2007. The data was chosen because it is ill-conditioned. The 
predictor variables studied as affecting the response variable! market capitalization) includes-share volume index, share 
value index, daily average volume, daily average value, number of listed securities, all share index, and number of listed 
companies. Below is the correlation matrix of the predictor variables. 

Correlations: shar vol, shar val, D Av vol, D.Av.val, N.lis sec., ... 





shar vol 


shar val 


D Av vol 


D.Av.val 


N.lis sec 


shar val 


0.995 










D Av vol 


0.999 


0.998 








D.Av.val 


0.993 


1.000 


0.997 






N.lis sec 


0.605 


0.585 


0.596 


0.581 




A sha ind. 


0.923 


0.893 


0.910 


0.883 


0.636 


N.lis. comp 


0.493 


0.434 


0.469 


0.421 


0.707 



0.700 
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Observe that the presence of multicollinearity is highly pronounced in the correlation matrix above with two 
predictor variables being perfectly correlated. Two of the eigenvalues of X'X matrix are zero and it is the matrix smallest 
eigenvalue. 

4.0 RESULT 



Table 1: Residual Analysis of Real life and Bootstrap Samples 



Samples 


Mean Squared 
Error from 
Unstandardized 
Variables 


Residual 


Mean Squared 
Error from 
Standardized 
Variables 


Residual 


Real Data 


103202 


0.009051 


0.9532. 


60063335 


Bootstrap 
S ample 1 


101 


0.011234 


0.9532 


12738352.5 


Bootstrap 
Sample 2 


63867 


0.009522 


0.7990 


87408183.4 


Bootstrap 
Sample 3 


576 


3.412094 


0.2228 


1.26xl0 8 


Bootstrap 
Sample 4 


2154 


0.988904 


1.557 


2.24xl0 9 


Bootstrap 
Sample 5 


135 


0.011232 


0.2469 


1.27xl7 8 



The sum of the squared of deviation of the y’s from their fitted values stated in (24) were calculated using 
different value of k obtained from (15). The variances were obtained from the estimated mean squared error of the real life 
and bootstrap data, before and after the data (X and Y) have been standardized. The result of analysis is presented in Table 
1 above. Table 1 shows that the biasing parameter is better estimated from the mean squared error of the original data as 
compared with different values obtained from the bootstrap samples when the data are not standardized. There is no 
definite order of choosing the appropriate values of k as one can observe from Table 1. 

Table 2: Residual Analysis from Arbitrary Values of k 



K Value 


Residual 


0 


0.00832 


0.00001 


0.00832 


0.006 


0.01114 


0.05 


0.01173 


1.0 


0.52339 



Table 2 shows, for some selected values of k, values of the residual sum of squares (17). In section 3.1, we made 
mention that the smallest eigenvalue of the X X is zero, then observing what we have in Table 2, one can see that as the 
value of the biased parameter k approaches zero (15) converges. Values of k that deviates much from the value of the 
smallest eigenvalue have large values of residual. 

5.0 CONCLUSIONS 

Comparing the residual of Tables 1 and 2, we can state that the biasing parameter of ridge regression may better 
be determined through the use of X X matrix. It is also observed that the smallest residual in Table 2 is very close to that of 
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Table 1. This is to say that both the mean squared error and the smallest eigenvalue of the predictor variables of the 
original data play vital role in determining the biased parameter of ridge regression. 
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